Service
Link, Iterated-scatter-gather, and Parcelation (SLIP)
Technology
Creating
and Visualizing a Citation Index: Exercise I
December 16, 2001
Obtaining Informational Transparency with Selective Attention
Dr. Paul S. Prueitt
President, OntologyStream Inc
December 16, 2001
Creating and
Visualizing a Citation Index
Paul S. Prueitt,
PhD
Section 1 is copied from a self-contained three-page overview of the SLIP-I-RIB Technology. Section 2 and 3 is an exercise designed for the beginner. Software for exercise is available at the OSI download index.
This exercise is about the first application of a SLIP Technology to full text mining.
We begin this tutorial with an acknowledgement. Cedar Tree Software has largely been responsible for the development of three KOS (Knowledge Operating System) Browsers for OSI process architecture. Under the direction of Don Mitchell, a joint project was conceived in support of OSI’s consulting work on an Incident Management and Intrusion Detection System (IMIDS). OSI consulted with several third parties in an effort to develop a state of the art IMIDS. However, the work with Cedar Tree Software was by far the most productive.
The concept of a KOS has evolved over five months of collaboration between Mitchell and OSI Founder, Paul Prueitt. A small (< 50K) operating shell was developed to have all properties that are shared in common with the three SLIP Browsers. Commonality is also sought for a voice activated state-gesture interface between a human and a small finite state machine. The finite state machine houses a control ontology that consists of grammar, methods that delegate commands, and a response mechanism that includes visual and auditory responses. This small operating shell is called the Root_KOS.
December 2001 saw the coming to a close of the first phase of the IMIDS development work. The conclusion of the first phase renewed interest by the client in SLIP. But, unfortunately the renewed interest came at a time when end-of-year R&D budgets where being cut. OSI developed a Summary of Possibilities in order to lay out the architecture for IMIDS and to state the case that R&D should be completed and then a full deployment of the new technology made.
A pause in funding was taken as an opportunity for OSI and its partners to re-examine the processes whereby innovation is developed and then hopefully deployed. We also made an internal commitment to complete the still unfinished Event Browser. We decided to develop the software in the public view and to reveal most of the algorithmic innovations related to the use of In-memory Referential Information Bases (I-RIBs).
By December 5th, 2001 OSI had generalized the model for event log analysis and Cedar Tree quickly made these generalization available in the SLIP Warehouse and SLIP Technology Browsers. On December 7th, an exercise on importing an arbitrary event log was made available to the public.
The term "Sensor" replaced the term "Shallow" on December 10th, 2001. The new data mining technology started to be referred to as Sensor Link, Iterated-scatter-gather and Parcelation (or SLIP).
OSI’s data mining technology is based on link analysis, emergent
computing and category theory. The
first suite of software applications are used to modeling distributed in
location and time computer hacker/cracker incident events and for modeling
computer (and infrastructure) vulnerabilities. This IMIDS technology is fully
operational and available for demonstration.
A standard link relationship is definable by the user using small Browsers. Each Browser is less then 350K in size and has no installation procedure.
Patterns revealed in the link relationship are used to define location and time distributed "events". These events are visualized as clusters and then as pictures that appear like chemical compounds.
a b
Figure 1: Two elementary types exist, atoms and links
It is felt that the SLIP Technologies provide a ready to use data
mining and data visualization tools.
·
Both atoms and
links are abstractions taken from the actual data invariance that exist in the
data source. The data source is any
event audit.
·
Automated
conversion of the event chemistry to finite state transition models (colored
Petri nets) is possible. This
conversion will push automation form the Browsers into an Intrusion Detection
System (IDS) or any distributed event detection system (DEDS) such as network
trouble ticket analysis systems deployed in telecommunications infrastructure.
·
A theory of
state transition and behavioral analysis is available and is to be applied (by
OSI) to creating templated profiles of opposition activity and intentions. The social science involved is subject to a
PhD dissertation and to scholarship by members of the BCNGroup Inc, an
foundation that supports basic research on behavioral and computational
neurosciences.
A critical issue in IMIDS has to do with the prediction of events
before they occur or the identification of an event while the event is
occurring.
Human analysis based on the viewing of event chemistry will be
predictive in three ways
1) The human
will have a cognitive aid for thinking about and talking with peers about the
events and event types
2) A top down
expectancy is provided for pattern completion of partially developed event
chemistry
3) Coherency
testing separates viewpoints into distinct graphic pictures and this provides
informational transparency with a selective attention directed by user voice
commands.
Although the SLIP-I-RIB technology was developed for seven regional
Computer Emergency Response Teams (CERTs) the small Warehouse Browser will take
ANY event log, and allow the user to define any link analysis
relationship. The Technology Browser
produces clustered visualization of the linkage over any small or large
dataset. The Event Browser will
produce two layers of event chemistry in correspondence to event atoms and
event compounds. The Enterprise IMIDS
(under development now) will push small mobile automation controllers (stand
alone programs) from the desktop into Distributed (IDS) and DEDS components.
In early December 2001, Mitchell and Prueitt spent a few days talking
about the computer science based on .NET Visual Basic and C# and theoretical
work based on a model of diffusion processes.
The notes from this discussion are available in the first exercise on
the event browser (part 2).
SLIP is complementary to knowledge-based systems. OSI is able to deploy a chosen knowledge
sharing system and the SLIP-I-RIB technology using a deployment compliance
model under development. Any one of
several enterprise knowledge sharing systems are readily deployable along with
the SLIP-I-RIB technology.
A process model for any such deployment has been under development. The
process model is simpler than the SW-CMM model for software procurement, and
reflects modern Knowledge Management practices, developed at George Washington
University and by several leading process theorists.
OSI has long had a interest in developing a SW-CMM type compliance model for the adoption of knowledge technologies. In 1990, SW-CMM was put forward as a Business Process Re-engineering type model to govern government procurement of software. This model has evolved to where it now governs quite a lot of the Federal government’s acquisition of software and software consulting services.
We propose that the sponsorship of
basic innovation in knowledge technology is not as functional as our social
needs would require. A process model
for the development of knowledge technology innovation is needed. The development and deployment of SLIP-I-RIB
Technologies is following such a process model.
Section 2: The first concept maps
The Event Brower scatters the atoms from a category into an object space. This space can be rendered in various ways. The first iteration of object space rendering is shown in Figure 2. These rendering where produced on December 15th, 2001.
Figure 2: Rendering of atom objects in the SLIP Object Space
A number of issues are recognized during the development of the rendering process. Some possible solutions to these issues are suggested in Section 3 of this Exercise.
Let us start with the data source. The Warehouse Browser needs a datawh.txt. By downloading the zip file, ecI.zip, one can examine a datawh.txt file that has 2,918 records, each record having two columns.
Figure 3: The Analytic Conjecture for the Fable Collection
The first of the two columns has token values and the second column has a name of one of the 332 short stories. The average length of a fable is about 200 words.
After unzipping ecI.zip remove all contents of the Data Folder except for datawh.txt. Then launch the Warehouse Browser and enter the following commands: a = 1; b = 0; pull; export. These four commands will produce Figure 3.
The development of the fable collection goes back to 1996 when Prueitt suggested that any autonomous declassification system would have to be capable of doing what he referred to as fable arithmetic. He pointed out that the mosaic effect would reveal hidden relationships and provide access to declassified concepts.
Fable arithmetic exists if we have a formal system that is able to add or subtract fables and produce a fable. The addition of two fables would have to be a fable that has all of the concepts that are present in the two fables, and is about of the same size. The new story would be judged, by a child, as being an Aesop fable. The subtraction of one fable from another fable would have to have all of the concepts of the first fable except those concepts that exists in the second fable. No new concepts could be added and yet the fable should be of the proper size and pass for a fable in the eyes of a child.
Well, of course such a system does not yet exist. In 2000, Prueitt began a conversation with M-CAM Inc (www.m-cam.com) over the possibility of annotating patents and patent applications through an automated means. One technology that almost does this is the Dr. Link technology that was at the time available from TextWise Inc and Manning and Naplier Information Systems. Dr. Liz Liddy (Syracuse University) developed a system for text analysis based on Peircean graphs and linguistic analysis. Several other systems were available, the most important of which was the Oracle ConText engine purchased by Oracle from Artificial Linguistic in about 1989. All of these systems have not survived the attention span of the marketing community, in spite of capabilities that are clearly needed by intelligence analysts.
Linguistic analysis of text using deep case grammar is clearly the best technology basis for autonomous rendering of the concepts in text. The second best technology is clearly latent semantic indexing. Then comes statistical word frequency analysis, which unfortunately is the most popular technology. Under a tutored hand, n-gram analysis can out perform statistical word frequency analysis.
Autonomy Inc and N-Corp Inc have proved that a profile-based push-pull information technology can make an impact in the marketplace. Expectations from Autonomy clients soared in 1999 and 2000, only to collapse in 2001. The problem has been that the core of the Autonomy engine is based on statistical word frequency analysis. Given any one of the better technologies for rendering concepts, the Autonomy system’s performance would increase. Of course this is a OSI claim that has not been demonstrated experimentally. However we have proposed a process for making such experimental determinations. The proposal is in a White Paper written for OSD in 2000.
One more issue should be mentioned. OSI claims that the voting
procedure will out-perform the Autonomy engine. This procedure is based on Russian quasi-axiomatic-logic and the
semiotic theory of reflective control, and is exceedingly simple. The voting procedure has also been used in a
prototype of a distance
learning system developed and reviewed by the State Department in 1998.
Section 2.2: The Technology Browser
By looking at the properties window of the Warehouse Browser we can see that 6916 pairs of tokens are placed into a file called “Paired.txt”. This file is located in the Data Folder. These 6916 pairs are developed using a combinatorial program developed by OSI for this purpose.
On launching the Technology Browser (see Figure 4) the user will see only the A1 category node and nothing in the Radial Plot window. To import the Paired.txt and extract atoms from a parse of the Paired.txt file, we use the following two commands:
Import
Extract
in the command line. Then the user may select the A1 node to see the randomly scattered atoms to the circle. Now type “cluster 30”. You will see that the distribution very rapidly moves to a spike. Let us look at this a bit closer. Type “cluster 200” to iterate the gather algorithm 200,000 times. You will see that all of the atoms are linked, and will move to a single spike (See Figure 4a)
a
b
Figure 4: A comparison of different kinds of limiting distributions
In Figure 4b we show a limiting distribution from the study of Intrusion Detection System audit logs. So the phenomena that all atoms move to the same location are an indication of a characteristic of a data set developed from the fable collection. In the foundational theorems of the SLIP theory, this phenomenon is seen as due to the category of all atoms being “prime” with respect to the Analytic Conjecture (see Figure 3). The interpretation is that the fables have highly interrelated concepts, which is of course true.
From the study of other data sets, one might realize that having only one prime is initially disappointing since multiple primes indicates multiple different characteristics of the data invariance. We also have some theorems on how to “fracture a prime” and produce substructure. Two of the previous Exercises involves prime fractures.
A close inspection of Figure 4a shows how this is done. We bracket out the center of the spike and put this center into the Category B1. We then use “Residue” command to put everything else in the category R. Given that we have removed the core connectivity of the conceptual linkage we now have the possibility of identifying a small but well-defined prime within the residue.
The user can randomize the A1 category (this should be the only category you have in your SLIP Framework). Just start the cluster process by typing “cluster 10”. If this is not sufficient, then enter “cluster 10” again until you have a cluster that is like Figure 4a. It might be better the catch the gather process early so that the spike is not so well formed. Now take 5 – 10 degrees out of the middle by typing
“x, y” to bracket the region
“x, y -> B1” to bring these atoms into category B1
Typing in a single degree value, between 0 and 360, will draw a red line from the origin to the circumference pointing at that degree.
Now click on the node A1 and type “residue” in the command line. This will produce the R category.
Now randomize the R category by typing “random”. Cluster just a bit to find a new small cluster. You are looking for a cluster with between 10 – 50 atoms that forms a well-defined spike.
Figure 5: A well defined spike in the residue
The user might try several times to get a small prime. Starting over is possible by closing the Browsers and copying the folder with the Browser and the Data folder and then deleting the A1 folder. One then needs to re-import the Paired.txt and extract the atoms.
Taking the spike forms C1. Use the indicator command. C1 may have atoms that are not closely related to the main body and so we may re-cluster C1 and move the spike into D1.
a b
Figure 6: Random scatter into the Object Space of D1
Once any node has been defined we may launch the Event Browser to look at that node’s atoms and event chemistry. In Figure 6b, we see randomly scattered atoms from category D1.
Launching the Event Browser requires that we locate the Members.txt file that is inside the folder corresponding to the node we wish to look into. In future versions the Event Browser will be launched from the Technology browser and this file selection will be automatic.
Figure 7: Random scatter into the Object Space of F1